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Abstract 

In this paper, we are interested in optimal decisions in a partially 
observable universe. Our approach is to directly approximate an opti- 
mal strategic tree depending on the observation. This approximation 
is made by means of a parameterized probabilistic law. A particular 
family of hidden Markov models, with input and output, is consid- 
ered as a model of policy. A method for optimizing the parameters 
of these HMMs is proposed and applied. This optimization is based 
on the cross-entropic principle for rare events simulation developed by 
Rubinstein. 

Keywords: Control, MDP/POMDP, Hierarchical HMM, Bayesian Networks, Cross- 
Entropy 

Notations. Some specific notations are used in this document. 

• The variables d, y, x and m are used for the decision, observation, 
world state and machine memory, 

• The time t is starting from stage 1 to the maximal stage T. Variables 
with subscript outside this scope are synonymous to 0. For example, 
Q+ = i vr(xj \xt-i) means 7r(xi|0) Ylt=2 ^(^t > *e. a Markov chain. 
A similar principle is used for the level supscript A in the definition of 
hierarchical HMM, 



• The generic notation for a probability is P. However, the functions 
p, 7T and h denote some specific components of the probability, p is 
the law of the observation y and state x conditionally to the decision 
d. -/r is a stochastic policy, ie. a law of the decision conditionally to 
the observation, h is an approximation of it by a HMM family. The 
hidden state of h is defined as the machine memory m. 

1 Introduction 

There are different degrees of difficulty in planning and control problems. In 
most problems, the planner have to start from a given state and terminate 
in a required final state. There are several transition rules, which condition 
the sequence of decision. For example, a robot may be required to move 
from room A, starting state, to room B, final state; its decision could be 
go forward, turn right or turn left, and it cannot cross a wall; these are the 
conditions over the decision. A first degree in the difficulty is to find at least 
one solution for the planning. When the states are only partially known or 
the resulting actions are not deterministic, the difficulty is quite enhanced: 
the planner has to take into account the various observations. Now, the 
problem becomes much more complex, when this planning is required to be 
optimal or near-optimal. For example, find the shortest trajectory which 
moves the robot from room A to room B. There are again different degrees 
in the difficulty, depending on the problem to be deterministic or not, de- 
pending on the model of the future observations. In the particular case 
of a Markovian problem with the full observation hypothesis, the dynamic 
programming principle^ could be efficiently applied (Markov Decision Pro- 
cess theory/MDP). This solution has been extended to the case of partial 
observation (Partially Observable Markov Decision Process/POMDP), but 
this solution is generally not practicable, owing to the huge dimension of the 
variables [Pl)1I3|. 

For such reason, different methods for approximating this problem has been 
introduced. For example, Reinforcement Learning methods are a ble 
to learn an evaluation table of the decision conditionnally to the known 
universe states and an observation short range. In this case, the range of 
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observation is indeed limited in time, because of an exponential grow of the 
table to learn. Recent works are investigating the case of hierarchical 
RL, in order to go beyond this range limitation. Whatever, these methods 
are generally based on an additivity hypothesis about the reward. Another 
viewpoint is based on the direct learning of the policy [Jj. Our approach 
is of this kind. It is particularly based on the Cross-Entropy optimisation 
algorithm developed by Rubinstein j^j. This simulation method relies both 
on a probabilistic modelling of the policies (in this paper, these models are 
Bayesian Networks) and on an efficient and robust iterative algorithm for op- 
timizing the model parameters. More precisely, the policy will be modelled 
by conditional probabilistic law, i.e. decisions depending on observations, 
which are involving memories; typically hidden Markov models are used. 
Also are implemented a hierachical modelling of the policies by means of 
hierarchical hidden Markov models. 

The next section introduces some formalism and gives a quick description of 
the optimal planning in partially observable universes. It is proposed a near- 
optimal planning method, based on the direct approximation of the optimal 
decision tree. The third section introduces the family of Hierarchical Hid- 
den Markov Models being in use for approximating the decision trees. The 
fourth section describes the method for optimizing the parameters of the 
HHMM, in order to approximate the optimal decision tree for the POMDP 
problem. The cross-entropy method is described and applied. The fifth sec- 
tion gives an example of application. A comparison with a Reinforcement 
Learning method, the Q-learning, is made. The paper is then concluded. 

2 Decision in a partially observable universe 

It is assumed that a subject is acting in a given world with a given purpose 
or mission. Thus, the subject interacts with the world and perceives partial 
informations. The goal is to optimize the accomplishment of the mission, 
which is characterized by its reward. The forthcoming paragraphs are for- 
malizing what is actually a world, what is a mission reward, and how is 
defined an optimal policy for such a mission. 
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The world. The world is described by an hidden state x, which evolves 
with the time t; in this paper, the time is discretized and increases from 
step 1 to step T. More specifically, the variable x% contains an infor- 
mation which characterizes entirely the world at time t. In the example 
of section the hidden state is characterized by the locations of the tar- 
get and patrols. The evolution of the hidden state is given by the vector 
x = x\ : t = xx, . . . , xt, ■ ■ ■ , xt- During the mission, the subject produces 
decisions d = d\-x which will impact the evolution of the world. In exam- 
ple^ d is the move of the patrols. The subject perceives partial observations 
from the world, denoted y = yi-.T, which are noisily derived from the hid- 
den state. In the example, this observation is an inaccurate estimate of the 
target location. As a conclusion, the world is characterized by a law describ- 
ing the hidden states and observations conditionnally to the decisions. This 
probabilistic law is denoted P: 

The hidden state xt and observation yt are obtained from the 
law P(xt, 2/t|xi:t-i, J/i:t-i, , which are conditionned by the 

past hidden states, observations and decisions. It is assumed 
that dt is generated by the subject after receiving yt . 

In this paper, the law P is quite general, and for example there is no Marko- 
vian hypothesis (this hypothesis is required for a dynamic programming 
approach). Nevertheless, it is assumed that P(xt, yt\%i:t-i> ^i:t-i) m &y be 
sampled very quickly. The law P(x,y\d) is illustrated by figure I n this 
figure, the out-going arrows are related to the data produced by the world, 
i.e. observations, while incoming arrows are for the data consummed by 
the world, i.e. the decisions. The variables are put in chronological order 
from left to right: yt happens before dt since decision dt is produced after 
observing yt . From now on, P(x, y\d) denotes the law of the world for the 
completed mission: 

T 

P(x,y\d) = Y[P(xt,yt\x 1:t -i,yi:t-i,di:t-i) . 
t=i 

Reward and optimal planning. The mission is limited in time and is 
characterized by a reward. This reward, denoted V(d,y,x), is a function of 
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Figure 1: The world 
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the trajectories d,y,x . Typically, the function V could be used for comput- 
ing the time needed for the mission accomplishment. The only hypothesis 
about V is that it is quickly computable. In particular, the additivity of the 
reward 1 with time, a requested hypothesis for many classical methods, is 
not necessary. 

The purpose is to construct an optimal decision tree y i— > (dt(yi-.t)\J = i) , 
depending on the past observations, in order to maximize the mean reward: 

d* eaxgm^^^P(x,y\(d t (yi-t)\f = i))V((d t (yi-t)\f^ 1 ),y,x) . (1) 
y x 

This optimization process is illustrated by figure^ The double arrows are 
related to the variables to be optimized. These arrows describe the in- 
formation flow between observations and decisions. The cells denoted oo 
are making decisions and transmitting all the received and generated infor- 
mations. This architecture illustrates that planning with observation is a 
non-finite memory problem : the decision depends on the whole past obser- 
vations. Since the optimum for such a problem is generally intractable, it 
is necessary to search for near-optimal solutions. The alternative method 
proposed now relies on the optimal tuning of a probabilitic model of the 
policies. 

Approximating the decision tree. In a program like , the variable 
to be optimized, d Q , is a deterministic object. In this precise case, d a is a 
tree of decision, that is a function which maps to a decision dt from any 
sequence of observation yi-.t-i ■ But it is more interesting to have a proba- 
bilistic viewpoint, when approximating. Then the problem is equivalent to 
1 Additive rewards are of the form V(d,y,x) = J2t=i Vt{dt,yt,x t ) 
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Figure 2: The optimization process 
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finding n{d\y) , a probabilistic law of the decisions conditionally to the past 
observations, which maximizes the mean reward: 

T 

= E E E J[<dt\di:t-i,yv.t)P{x, y\d) V(d, y, x) . 
d y x t=l 

This new problem is still illustrated by figure El but the double arrows are 
now describing a Bayesian network structure for the law ir. By the way 
there is not a great difference with the deterministic case for the optimum: 
when d a is unique, the optimal law tt q £ arg max,,- V(ir) is a dirac on d a . 
However, the probabilistic viewpoint is more suitable to an approximation: 
it is simplier to handle probabilistic models than deterministic decision trees, 
and the optimization is ensured to be continuous; moreover, a natural ap- 
proximation of 7r is obtained by replacing the non-finite memories oo by 
finite memories m; c.f. figure |3 Restricting the memory size of the policies 
is equivalent to approximate the law ir by a hidden Markov Model. Then, 
the approach developped in this paper is quite general and can be split up 
into two processes: 

• Define a family of parameterized HMMs 7i , 

• Optimize the parameters of the HMM in order to maximize the mean 
reward: 

Find ho € argmaxV(/i) . 
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Figure 3: Finite- memory approximation 
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As will be seen later, it is easy to tune a HMM optimally by the Cross- 
Entropy method of Rubinstein 9 . But first, it is discussed in the next section 
about the choice of the familly 7i. 

3 Models 

General points. The choice of the family of policy models, 7i, will pro- 
foundly impact the efficiency of the approximation. In particular, the models 
will be characterized by the memory size and the internal structure of the 
HMMs (e.g. is it hierarchical or not?). Both characteristics will act upon 
the convergence, as will be seen in the experiments. In the most simple case, 
the HMMs of 7i contain no structure and are distinguished by their memory 
size only. Example of simple HMM: 

Let M be indeed a finite set of states, describing the memory 
capacity of our models. Then, the memory of the HMM at time 
t is mt € M, a variable valued within M. A HMM h € TL is thus 
typically defined by: 

h(d\y) = ^2 h ^ m \y) ' 

m€M T 
T 

h(d,m\y) = \\{h d (<k\rn t )h m (rnt\yt,rn t -i)) , 
t=i 

where the conditionnal law and h m are time invariant. 
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But subsequently will be considered the impact of both the memory and 
HMM stuctures. For this purpose a specific family of hierarchical HMM will 
be introduced and studied. HHMM are indeed a particular case of HMM, 
implementing strong intern structures. 

Hierarchical HMM. Hierarchical models are inspired from biology: to 
solve a complex problem, factorize it and make decisions in a hierarchi- 
cal fashion. Low hierarchies manipulate low level informations and actions, 
making short-term decisions. High hierarchies manipulate high level in- 
formations and actions (uncertainty is less), making long-term decisions. 
Hierarchical HMM are such kind of models. A hierarchical hidden Markov 
model (HHMM) is a HMM which output is either a hierarchical HMM or 
an actual output. A HHMM could also be considered as a hierarchy of 
stochastic processes calling sub-processes. From this common definition, 
HHMM are complex structures, which are difficult to formalize and to com- 
puterize. Nevertheless, these models have been introduced and applied for 
handwriting recognition [^] , as well for modelling complex worlds in control 
applications A fundamental contribution has been made by Murphy 

and Paskin [B] , which have shown how HHMM could be interpreted as a par- 
ticular 2— dimension dynamic Bayesian Network. Now, Dynamic Bayesian 
Networks are easily formalized, manipulated and computerized. DBN could 
be considered as HMM with complex intern structures. From the work of 
Murphy and Paskin, it could be shown that a hierarchical HMM (with input 
and output) could be interpreted by a DBN as described in figure 0J with 
discrete or semi-continuous states. It appears, that there is a up and down 
flow of the information between the hierarchical levels in addition to the 
usual temporal flow (the Markovian property). It is important to note that 
boolean informations are necessary for implementing the hierarchy. These 
boolean are needed for controlling the information flows betwenn processes 
and subprocesses. The next paragraph introduces the customized model of 
HHMM, which has been considered in this work. It is simplification of the 
general HHMM model, and it allows a more simple implementation. 

Implemented model. The implemented model familly TC is composed 
by HHMM with A hierarchical levels. Each level A € [1,A] is associated 
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Figure 4: Model of a controlled Hierarchical HMM 
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o/ v / A = information+boolean / ouput / input 

to a finite memory set M x (the memory size may change with the hierar- 
chy). The exchange of information between the levels is characterized by 
the DBN illustrated in figure |SJ Notice that each memory cell receives an in- 
formation from the current upper-level cell and the previous lower-level cell. 
As a consequence, the hierarchical and temporal information exchanges are 
guaranted. In a more formal way, the HHMM h € 7i are of the form: 

K d \v) = ^2 h (d,m\y) , 

m£M AT 

T A 

h(d,m\y) = Y[h°(d t \m 1 t )h 1 (m 1 t \yt,m 2 t ) J] h x (m$\m^, m^ +1 ) , 

t=l A=2 

where m A G M A is the variable for the memory at level A. It is noteworthy 
that this model is equivalent to a simple HMM when A = 2 . And when 
A = 1 , the law h just maps the immediate observation to decisions, without 
any memory of the past observations. 

For any h £ Ti, define P[h] the complete probabilistic law of the system 
world/subject: 

P[h]{d,y,x,m) = P{y,x\d)h(d,m\y) 
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Figure 5: HHMM model for the planning 
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Then the issue is to find the near-optimal strategy h Q £ Ti such that: 
h Q € argmax N P[h](d, y, x, m)V(d, y, x) . 

d,y,x,m 

A solution to this problem, by means of the cross-entropy method, is pro- 
posed in the next section. 

4 Cross-entropic optimization of h 

The reader interested in CE methods should refer to the tutorial |Hj and the 
book |5] on the CE method. CE algorithms were first dedicated to estimat- 
ing the probability of rare events. A slight change of the basic algorithm 
made it also good for optimization. In their new article Homem-de-Mello 
and Rubinstein have given some results about the global convergence. In 
order to ensure such convergence, some refinements are introduced particu- 
larly about the selective rate. 

This presentation is restricted to the basic CE method. The new improve- 
ments of the CE algorithm proposed in [H] have not been implemented, but 
the algorithm has been seen to work properly. For this reason, this paper 
does not deal with the choice of the selective rate. 
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4.1 General CE algorithm for the optimization 

The Cross Entropy algorithm repeats until convergence the three successive 
phases: 

1. Generate samples of random data according to a parameterized ran- 
dom mechanism, 

2. Select the best samples according to a reward criterion, 

3. Update the parameters of the random mechanism, on the basis of the 
selected samples. 

In the particular case of CE, the update in phase 3 is obtained by minimizing 
the Kullback-Leibler distance, or cross entropy, between the updated random 
mechanism and the selected samples. The next paragraphs describe on 
a theoretical example how such method can be used in an optimization 
problem. 

Formalism. Let be given a function x t— > f(x); this function is easily 
computable. The value f(x) has to be maximized, by optimizing the choice 
of x G X. The function / will be the reward criterion. 

Now let be given a family of probabilistic laws, P a \aeT, , applying on the 
variable x. The family P is the parameterized random mechanism. The 
variable x is the random data. 

Let p g]0, 1[ be a selective rate. The CE algorithm for (x, /, P) follows the 
synopsis : 

1 . Initialize a G £ , 

2. Generate N samples x n according to P a , 

3. Select the pN best samples according to the reward criterion / , 

4. Update a as a minimizer of the cross-entropy with the selected samples: 



5. Repeat from step |2] until convergence. 

This algorithm requires f to be easily computable and the sampling of P a to 
be fast. 



a G argmax 




n selected 
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Interpretation. The CE algorithm tightens the law P a around the max- 
imizer of /. Then, when the probabilistic family P is well suited to the 
maximization of / , it becomes equivalent to find a maximizer for / or to 
optimize the parameter a by means of the CE algorithm. The problem is 
to find a good family. . . Another issue is the criterion for deciding the con- 
vergence. Some answers are given in [£]. Now, it is outside the scope of 
this paper to investigate these questions precisely. Our criterion was to stop 
after a given threshold of successive unsuccessful tries and this very simple 
method have worked fine on our problem. 

4.2 Application 

Optimizing h € TC means tuning the parameter h in order to tighten the 
probability P[h] around the optimal values for V . This is exactly solved by 
the Cross-Entropy optimization method. However, it is required that the 
reward function V is easily computable. Typically, the definition of V may 
be recursive, e.g. : 

V(d, y, x) = V T ; V t = v t (d t , y t ,x t , Vt-i) and Vq = . 

Let the selective rate p be a positive number such that p < 1 . The cross- 
entropy method for optimizing h follows the synopsis : 

1. Initialize h . For example a flat h, 

2. Build N samples 9 n = (d n , y n , x n , m n ) according to the law P[h], 

3. Choose the pN best samples 9 n according to the reward V(d n , y n ,x n ) . 
Denote S the set of the selected samples, 

4. Update h as the minimizer of the cross-entropy with the selected sam- 



ples: 



h G argmax 
h&H 



J>P[/l](0 n ) 



(2) 
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Reiterate from step 121 until convergence. 



For our HHMM model, the maximization (J2J) is solved by: 




card< n 6 S ,t / A = d™ and B = m 
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card-! n £ S ,t / A = m\ ,n ,B = yf and C = rr\ ,n > 

h\A\B,C) = ^ = — } - . 

card <n£S,t/B = yt and C = m t ' n > 

and for 2 < A < A ,: 

card|n E S ,t / A = m^' n , B = m^Z\' n and C = m^ +1,n | 



h (A\B, C) 



card|ra £ S ,t / B = m^_^' n and C = ?n^ +1,n | 



The next section presents an example of implementation of the algorithm 
described in section li~2l 

5 Implementation 

The algorithm has been applied to a simulated target detection problem. 
5.1 Problem setting 

A target R is moving in a lattice of 20 x 20 cells, ie. [0, 19 ] 2 . R is tracked 
by two mobiles, B and C, controlled by the subject. The coordinate of R, 

B and C at time t are denoted (^,Jjj), (^b^b) ano - i^c^c)- ^ an< ^ ^ nave 
a very limited information about the target position, and are maneuvering 
much slower: 

• A move for B (respectively C) is either: turn left, turn right, go for- 
ward, no move. Consequently, there are 4 x 4 = 16 possible actions 
for the subject. These moves cannot be combined in a single turn. No 
diagonal forward: a mobile is either directed up, right, down or left, 

• The mobiles are initially positioned in the down corners, ie. i B = 0, 
j B = 19 and %q = 19, Jq = 19. The mobile are initially directed 
downward, 

• B (respectively C) observes whether the target relative position is 
forward or not. More precisely: 

— when B is directed upward, it knows whether jr < jb or not, 

— when B is directed right, it knows whether in> is or not, 

— when B is directed downward, it knows whether > js or not, 
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— when B is directed left, it knows whether < is or not, 

• B (respectively C) knows whether its distance with the target is less 
than 3, ie. doo(B, R) < 3, or not. The distance doo is defined by: 

doo(B, R) = max{|i B - ir\ , \jb ~ 3r\] ■ 

At last, there are 2 4 = 16 possible observations for the subject. 

Several test cases have been considered. In case 1, the target R does not 
move. In any other case, the target R chooses stochastically its next position 
in its neighborhood. Any move is possible (up/down, left/right, diagonals, 
no move). The probability to choose a new position is proportional to the 
sum of the squared distance from the mobiles: 

'P(^ +1 |i?*) = if - 4| > 1 or -? R \ > 1 , 
< P(R t+1 \R t ) oc - 4) 2 + (j^ 1 - f B f 

{ +(4 +1 -^) 2 + 0H +1 -ia) 2 else. 

This definition was intended to favorize escape moves: more great is a dis- 
tance, more probable is the move. But in such summation, a short distance 
will be neglected compared to a long distance. It is implied that a distant 
mobile will hide a nearby mobile. This "deluding" property will induce ac- 
tually two different kinds of strategy, whithin the learned machines. 

The objective of the subject is to maintain the target sufficiently closed to 
at least one mobile (in this example, the distance between the target and 
a mobile is required to be not more than 3). More precisely, the reward 
function, V, is just counting the number of such "encounter": 

Vo = 0; V t = Vt-!+l ifd 0O (B t ,tf)<3ord 0O (C t ,tf)<3; V t = V t -\ else. 

The total number of turns is T = 100. 

5.2 Results 

Generality. Like many stochastic algorithms, this algorithm needs some 
time for convergence. For the considered example, about two hours were 
needed for convergence (on a 2GHz PC); the selective rate was p = 0.5. 
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This speed depends on the size of the HHMM model and on the conver- 
gence criterion. A weak and a strong criterion are used for deciding the 
convergence. Within the weak criterion, the algorithm is terminated after 
100 successive unsuccessful tries. Within the strong criterion, the algorithm 
is terminated after 500 successive unsuccessful tries. Of course, the strong 
criterion computes a (slightly) better optimum than the weak criterion, but 
it needs time. Because of the many tested examples, the weak criterion has 
been the most used in particular for the big models. For the same HHMM 
model, the computed optimal values do not depend on the algorithmic in- 
stance (small variations result however from the stochastic nature of the 
algorithm). 

In the sequel, mean rewards are rounded to the nearest integer, or are ex- 
pressed as a percentage of the optimum. Thus, the presentation is made 
clearer. And owing to the small variations of this stochastic algorithm, 
more precision turns out to be irrelevant. 

Case 1: R does not move. This example has been considered in order 
to test the algorithm. The position of the target is fixed in the center of the 
square space, ie. i R = j\ = 10. It is recalled that the mobiles are initially 
directed downward. Then, the optimal strategy is known and its value is 85 : 
the time needed to reach the target is 15 , and no further move is needed. 
The learned h Q approximates the reward 84 . The convergence is good. 

Case 2: R is moving but the observation y is hidden. Initially, R 
is located within the 20 x 10 upper cells of the lattice (ie. [0, 19] x [0,9]), 
accordingly to a uniform probabilistic law. The computed optimal means 
reward is about 32. In this case, the mobiles tend to move towards the upper 
corners. 

Case 3: R is moving and y is observed. Again, R 1 is located uniformly 
within the 20 x 10 upper cells of the lattice. The computed optimal means 
reward is about 69. This reward has been obtained from a large HHMM 
model (A = 2 with 256 states per level, ie. card(M A ) = 256) and with the 
strong criterion. However, somewhat smaller models should work as well. 

Specific computations are now presented, depending on the number of levels 
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A and the number of states per levels. For each case, the weak criterion has 
been used. The rewards are now expressed as percentage. 

Subcase A = 1 . For such model, the action dt is constructed only from the 
immediate last observation yt. The model does not keep any memory of 
the past observations. Then, only 16 states are sufficient to describe the 
hidden variable m\ , ie. card(M 1 ) = 16. The resulting reward is 78% of the 
optimum. 

Subcase A = 2 . This model is equivalent to a HMM and it is assumed that 
card(M 1 ) = card(M 2 ). The following table gives the computed reward for 
several choices of the memory size: 



card(M A ) 


16 


32 


64 


256 


Reward 


94% 


96% 


97% 


97% 



It is noteworthy that the memory of the past observations allows better 
strategies than the only last observation (case A = 1) . Indeed, the reward 
jumps from 78% up to 97%. 

Subcases A > 2. A comparison of graduated hierarchic models, 1 < A < 4, 
has been made. The first level contained 16 possible states, and the higger 
levels were restricted to 2 states: 



hierarchic grade 


A = 1 


A = 2 


A = 3 


A = 4 


card(Af A )|^ =1 


16 


16,2 


16,2,2 


16,2,2,2 



The test has been accomplished according to the weak criterion: 



hierarchic grade 


A = 1 


A = 2 


A = 3 


A = 4 


Reward (weak) 


78% 


85% 


81% 


94% 



and the strong criterion: 



hierarchic grade 


A = 1 


A = 2 


A = 3 


A = 4 


Reward (strong) 


80% 


88% 


93% 


96% 



It seems that a high hierarchic grade (i.e. more structure) makes the con- 
vergence difficult. This is particularly the case here for the grade A = 3, 
which failed under the weak criterion at only 81%. However, the algorithm 
works again when improving the convergence criterion. 



16 



It is interesting to make a comparison with the subcase A = 2 where 
card(M 1 ) = card(M 2 ) = 16. Under the weak criterion, the result for this 
HHMM was 94% as for the grade A = 4. However, the dimension of the law 
is quite different for the two models: 

• 15 x 16 + 15 x 16 x 16 + 15 x 16 = 4320 for the 2-level HHMM, 

• 15 x 16 + 15 x 16 x 2 + 1 x 16 x 2 + 1 x 2 x 2 + 1 x 2 = 758 for the 
4-level HHMM. 

This dimension is a rough characterization of the complexity of the model. It 
seems clear on these examples that the highly hierarchized models are more 
efficient than the weakly hierarchized models. And the problem considered 
here is quite simple. On complex problems, hierarchical models may be 
pre-eminent. 

Global behavior. 

The algorithm. The convergence speed is low at the beginning. After this 
initial stage, it improves greatly until it reaches a new "waiting" stage. This 
alternation of low speed and great speed stages have been noticed several 
times. 

The near optimal policy. It is now discussed about the behaviour of the 
best found policy. This policy has reach the mean reward 69. The mobiles 
strategy results in a tracking of the target. The figure El illustrates a short 
sequence of escape/ tracking of the target. It has been noticed two quite 
distinct behaviours, among the many runs of the policy: 

• The two mobiles may both cooperate on tracking the target, 

• When the target is near a border, one mobile may stay along the op- 
posite border while the other mobile may perform the tracking. This 
strategy seems strange at first sight. But it is recalled that the moving 
rule of the target tends to neglect a nearby mobile compared to a dis- 
tant mobile. In this strategy, the first mobile is just annihilating the 
ability of the target to escape from the tracking of the second mobile. 
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Figure 6: Near-optimal control sequence 




x = target • = observer 1 o = observer 2 
Relative times are put in supscript 

5.3 Comparison with the Q-learning 

The Q-learning is a reinforcement learning method, which is based on the 
computation of a table evaluating the decision conditionnally to the known 
information. The known information is typically the state of the world if it 
is known, or partial states and observations. Since the known information 
increases exponentially with the observation range, the test will only imple- 
ment a Q-learning based on the immediate past observation. Now, let us 
recall some theoretical grounds about the Q-learning. 

Theory. A founding reference about reinforcement learning is the well 
known book of Sutton and Barto ^Tj, which is available on internet. This 
paragraph will not enter deeply into the subject, and is limited to a simple 
description of the Q-learning. Moreover, we will make the hypothesis of 
infinite horizon (that is T = oo) with a weak discounting of the reward 
7 = 0.99, so as to implement the algorithm in its most classical form. Tests 
however have also been made with a finite horizon but have not achieved a 
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good convergence for the considered algorithm. 
The learning relies on the following hypotheses: 

• At each step t, the subject has a (partial) knowledge s of the state of 
the world, and chooses an action a, 

• Let Vt+i be the cumulated reward from step t + 1 to step oo. Assume 
a state st and action at at step t. Then Vt = R(st, a t ) +jVt+i , i.e. an 
instantaneous reward R is obtained and cumulated to the discounted 
future reward. 

The question is: being given a current state s, what is the best action a to 
be done? The answer is simple, if we are able to predict the future and 
evaluate the expected cumulated reward Q(s,a) for any a: the best action 
is a Q £ arg max a Q(s, a) . The following algorithm could be used for learning 
the table Q (taken from : 

• Initialize Q(s,a) arbitrary 

• (Repeat for each episode: [finite-horizon case]) 

— Initialize s 

— Repeat for each step (of the episode) : 

* With probability 1 — e choose a € arg max a Q(s, a) ; otherwise 
chose a randomly 

* Take action a, receive reward R(s,a) and observe the new 
state s 1 

* Set Q(s, a) := Q(s, a)+a(R(s, a)+7max a / Q(s', a') — Q(s, a)) 

* Set s := s' 

— (until s is terminal) 

where a controls the convergence speed and e the innovation. 

In our implementation, s = (yt, v B ,y B , Vq,^, directions), a = dt, a = 0.1, 
e = 1/lni and the instantaneous reward R is complient with the experiment 
definition of previous section. Since s contains the last observation plus the 
known part of the world state, this experiment should be equivalent to [case 
3/subcase A = 1] considered previously. The computer memory needed to 
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store the table Q was approximately 2 giga-byte: we are around the limits 
of the computer. In particular, it is rather uneasy to involve a greater 
observation range without some approximations. 

Results. The algorithm has been stoped after 10 11 iterations, but 10 10 
seemed sufficient. It took several hours, but the algorithm has not been 
optimized. In order to make the comparison possible with our method, 
the Q-strategies has been evaluated by a non-discounted cumulation of the 
reward on 100-step-wide windows. Moreover, these evaluations have been 
made: 

• from the initial stage of the simulation, so as to conform to previous 
section, 

• after many cycles, so as to simulate an infinite horizon. 

The following table makes a comparison between the Q-strategies and the 
model based strategies with A = 1. 





worse 


mean 


best 


Q-policy/stage 


0% 


40% 


112% 


Q-policy/oo-horizon 


0% 


51% 


145% 


Model based A = 1 


44% 


78% 


105% 



It is first noticed that the policy obtained by the Q-learning is less regulated 
than the model based policy. Moreover, although it may be quite good 
to track a target when the encounter has been inited (best is 145%), it is 
rather bad at initing the encounter (mean for initial stage is 40%) or when 
the tracking is lost (worst is 0%). At last, the mean evaluation at infinite 
horizon is 51%, which is even smaller than the model-based policy working 
from the initial stage. 

On this example, and for this simple Q-learning implemention, the compar- 
ison is favorable to the model-based policy. Moreover, model-based policies 
are able to manage more observation range. Now, this planning example has 
been constructed so as to make difficult the management of the state vari- 
ables (the dimension is huge) and observations (the observations are poor 
and have to be combined). For such a problem, a more dedicated RL- method 
should be chosen. 
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6 Conclusion 



In this paper, we proposed a general method for approximating the optimal 
planning in a partially observable world. Hierarchical HMM families have 
been used for approximating the optimal decision tree, and the approxima- 
tion has been optimized by means of the Cross-Entropy method. 

At this time, the method has been applied to a strictly discrete-state prob- 
lem and has been seen to work properly. This algorithm has been compared 
favorably with a Q-learning implementation of the considered problem: it 
is able to manage more observation range, and the optimized policy is more 
regulated. An interesting point is that the optimized policy has discovered 
two quite different global strategies and is able to choose between them: 
make the mobiles both cooperate on tracking or require one mobile for de- 
luding the target. 

The results are promising. However, the observation and action spaces are 
limited to a few number of states. And what happens if the hidden space 
becomes much more intricated? There are several possible answers to such 
difficulties: 

First, the cross-entropic principle could be applied for optimizing continu- 
ous laws. It is thus certainly possible to consider semi-continuous models, 
which will be more realistic for a planning policy. Secondly, many refine- 
ments are foreseeable about the structure of the models. Hierarchic models 
for observation, decision and memory should be improved in order to locally 
factorize intricated problems. This research is just preliminary and future 
works should investigate these questions. 
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